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Abstract 

While the previous chapter (Robert and Rousseau, 
2010) addressed the foundational aspects of Bayesian 
analysis, the current chapter details its practical as- 
pects through a review of the computational meth- 
ods available for approximating Bayesian procedures. 
Recent innovations like Monte Carlo Markov chain, 
sequential Monte Carlo methods and more recently 
Approximate Bayesian Computation techniques have 
considerably increased the potential for Bayesian ap- 
plications and they have consequently opened new 
avenues for Bayesian inference, first and foremost 
Bayesian model choice. 

Keywords: Bayesian inference, Monte Carlo meth- 
ods, MCMC algorithms, Approximate Bayesian 
Computation techniques, adaptivity, latent variables 
models, model choice. 



1 Introduction 

The previous chapter (Robert and Rousseau, 2010) 
has (hopefully) stressed the unique coherence of 
Bayesian data analysis — the complete inferential 
spectrum (estimators, predictors, tests, confidence 
regions, etc.) is derived from a unique perspective, 
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once both a loss function and a prior distribution are 
constructed — , but it has not adressed the complex 
issues related to the practical implementation of this 
analysis that usually involves solving integration, op- 
timisation and implicit equation problems, most of- 
ten simultaneously. 

This computational challenge offered by Bayesian 
inference has led to a specific branch of Bayesian 
statistics concerned with these issues, from the early 
approximations of Laplace to the numerical prob- 
ability developments of the current days. In par- 
ticular, the past twenty years have witnessed a 
tremendous surge in computational Bayesian statis- 
tics, due to the introduction of powerful approxima- 
tion methods like Markov chain (MCMC) and se- 
quential Monte Carlo techniques. To some extent, 
this branch of Bayesian statistics is now so intricately 
connected with Bayesian inference that some notions 
like Bayesian model choice and Bayesian model com- 
parison hardly make sense without it. 

The probabilistic nature of the objects, involved in 
those computational challenges, as well as their po- 
tcntialy high dimension, led the community to opt 
for simulation based, rather than numerical, solu- 
tions. While numerical techniques are indeed used 
to solve some optimisation or some approximation 
setups, even producing specific approaches like varia- 



tional Bayes ( Jaakkola and Jordan 2000 ), the method 



of choice is simulation, i.e. essentially the use of com- 
puter generated random variables and the reliance 
on the Law of Large Numbers. For instance, all ma- 
jor softwares that have been built towards Bayesian 
data analysis like WinBUGS and JAGS, are entirely 
depending upon simulation approximations. We will 
therefore abstain from describing any further the nu- 
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merical advances found in this area, referring the 



reader to |Spall| (|2003j> and |Gentle| (|2009j> for proper 
coverage. 

In this chapter, we thus discuss simulated-based 
computational methods in connection with a few 
model choice examples (Section [2]), separating non- 
Markovian (Section [3]) from Markovian (Section [4j 
solutions. For detailed entries on Bayesian com- 
putational statistics, we refer the reader to [Chen 



et al. (2000), Liu (2001) or Robert and Casella 



2004 



20091), pointing out that books likelAlbertl d2009b and 



Marin and Robert (20071 encompass both Bayesian 



inference and computational methodologies in a sin- 
gle unifying perspective. 



2 Computational difficulties 

In this section, we consider two particular types of 
statistical models with computational challenges that 
can only be processed via simulation. 



2.1 Generalised linear models 



Generalised linear models (McCullagh and Nelder 
1989) are extensions of the standard linear regres- 
sion model. In particular, they bypass the compul- 
sory selection of a single transformation of the data 
that must achieve the possibly conflicting goals of 
normality and linearity, goals which are imposed by 
the linear regression model but that are impossible 
to achieve for binary or count responses. 

Generalised linear models formalise the connection 
between a response variable y £ R and a vector x G 
MP of explanatory variables. They assume that the 
dependence of y on x is partly linear in the sense that 
the conditional distribution of y given x is defined in 
terms of a linear combination x T /3 of the components 
of x (x T being the transpose of x) , 



y|x,/3~/(y|x T /3). 

We use the notation y = {y\ , . 
of n responses and 



,y n ) for a sample 



X U X 12 

%ai x 32 



x = 



Xl 



Xn 



for the n x p matrix of corresponding explanatory 
variables, possibly with xn = . . . = x n i = 1 (y and 
x correspond to generic notations for single-response 
and covariate vectors, respectively). 

A generalized linear model is specified by two func- 
tions: 

(i) a conditional density / on y conditional on x 
that belongs to an exponential family and that 
is parameterized by an expectation parameter 
(i = /i(x) = E[y|x] and possibly a dispersion 
parameter <fi > that does not depend on x; 
and 

(ii) a link function k that relates the mean fx = /i(x) 
of / and the covariate vector, x, through fc(/i) = 
(x T /3), pew. 

For identifiability reasons, the link function k is a 
one-to-one function and we have 

E[y|x,/3,</,]=fc- 1 (x T /3) . 

We can thus write the (conditional) likelihood as 

n 

^(/3,0|y,X) = rj/( yi |x lT /3,0) . 

i=l 

In practical applications like econometrics or ge- 
nomics, p can be very large and even larger than the 
number of observations n. Bayesian data analysis on 
(3 and possibly (f> proceeds through the posterior dis- 
tribution of (/3, 0) given (X, y): 

n 

7T(/3, 0|X, y) oc J] / (y t |x' iT /3, 0) ttGMX) (1) 

i=i 

which is never available as a standard distribution 
outside the normal linear model. Indeed, the choice of 
the prior distribution n((3, </>|X) depends on the prior 
information available to the modeller. In cases when 
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1, we will use the default solution advocated 



Marin and Robert ( 2007 ) , namely the extension of 



Zellncr's g-prior that was originaly introduced for the 
linear model, as discussed in the previous chapter: 



/3|X~AM0,n(X T X) 



The motivation behind the factor n is that the infor- 
mation brought by the prior is scaled to the level of 
a single observation. Even this simple modeling does 
not avoid the computational issue of exploiting the 
posterior density ([!]). 

Example 1. A specific if standard case of gener- 
alised linear model for binary data is the probit model: 
y(fi) = {0, 1} and we have 

P(Y = l|x) = 1 - P(Y = 0|x) = $(x T /3) , 

where $ denotes the standard normal cumulative dis- 
tribution function. Under the g-prior tt(/3|X) pre- 
sented above, the corresponding posterior distribu- 
tion, proportional to 



tt(/3|X) n^x^f^-x^) 1 - 



(2) 



is available in closed form, up to the normalising con- 
stant, but is not a standard distribution and thus 
cannot be easily handled! 

In this chapter, we will use as illustrative data the 



Pima Indian diabetes study available in R ( R Devel- 
opment Core Team|2008 ) as the Pima.tr dataset with 
332 women registered and consider a probit model 
predicting the presence of diabetes from three pre- 
dictors, the glucose concentration (glu), the diastolic 
blood (bp) pressure and the diabetes pedigree func- 
tion (ped), 

¥(y = l|x) = ${xxfa + x 2 (3 2 + x 3 /3 3 ) . 

A maximum likelihood estimate of the regression co- 
efficients is provided by R glm function as 

Deviance Residuals: 

Min 1Q Median 3Q Max 

-2.1347 -0.9217 -0.6963 0.9959 2.3235 
Coefficients : 



Estimate Std. Error z value Pr(>|z|) 
glu 0.012616 0.002406 5.244 1.57e-07 *** 
bp -0.029050 0.004094 -7.096 1.28e-12 *** 
ped 0.350301 0.208806 1.678 0.0934 . 

Signif. codes: '***> 0.001 '**' 0.01 '*' 0.05 



0.1 



Null deviance: 460.25 on 332 degrees of freedom 
Residual deviance: 386.73 on 329 degrees of freedom 
AIC: 392.73 

Number of Fisher Scoring iterations: 4 

the final column of stars indicating a possible sig- 
nificance of the first two covariates from a classical 
viewpoint. -4 

This type of model is characteristic of conditional 
models where there exist a plethora of covariates Xi — 
again, potentially more than there are observations — 
and one of the strengths of Bayesian data analysis is 
to be able to assess the impact of those covariates on 
the dependent variable y. This obviously is a special 
case of model choice, where a given set of covariates 
is associated with a specific model. As discussed in 
the previous chapter, the standard Bayesian solution 
in this setting is to compute posterior probabilities 
or Bayes factors for all models in competition. For 
instance, if a single covariate, x 3 (ped) say, is under 
scrutiny, the Bayes factor associated with the null 
hypothesis H : /3 3 = is 



B oi = 



ra (y) 
mi(y) 



(3) 



where mo and mi are the marginal densities under 
the null and the alternative hypotheses Q i.e. 



/(y |/3, X^tr 0(31X^/3, 



7To being the g-prior excluding the covariate x 3 . 

If we denote by Xo the 332 x 2 matrix containing 
the values of glu and bp for the 332 individuals and 
by Xi the 332 x 3 matrix containing the values of the 
covariates glu, bp and ped, the Bayes factor Bqi is 
given by 



*As will become clearer in Section |3.3[ Bayes factor approx- 
imations arc intrinsically linked with the normalising constants 
of the posterior distributions of the models under competition. 
We already stressed in Robert and Rousseau (2010) that this 
is a special case when normalising constants matter! 



3 



(2?r) 



1/2 1/2 I (XjXo) 



-1/2 



f. 

fm 3 



KXJ-X0I-V2 



using the shortcut notation that Ai t . is the i-th line 
of the matrix A. 

The approximation of those marginal densities, 
which are not available outside the normal model 
(see, e.g., Marin and Robert|2007 Chapter 3), is thus 
paramount to decide about the inclusion of available 
covariates. 

In this setting of selecting covariates in a condi- 
tional model, an additional and non-negligible com- 
putational difficulty is that the number of hypotheses 
to be tested is 2 P if each of the p covariates is under 
scrutiny. When p is large, it is simply impossible to 
envision all possible subsets of covariates and a fur- 
ther level of approximation must be accepted, namely 
that only the most likely subsets will be visited by an 
approximation method. 

2.2 Challenging likelihoods 

A further degree of difficulty in the computational 
processing of Bayesian models is reached when the 
likelihood function itself cannot be computed in a rea- 
sonable amount of time. Examples abound in econo- 
metrics, physics, astronomy, genetics, and beyond. 
The level of difficulty may be that the computation 
time of one single value of the likelihood function re- 
quires several seconds, as in the cosmology analysis 



of Wraith et al. (2009) where the likelihood is repre- 



sented by an involved computer program. It may also 
be that the only possible representation of the likeli- 
hood function is as an integral over a possibly large 
number of latent variables of a joint (unobserved) 
likelihood. 

Example 2. (Continuation of Example [lj Although 
the likelihood of a probit model is available in closed 
form, this probit model can be represented as a natu- 
ral latent variable model. If we introduce an artificial 



sample z = (z%, . . . ,z n ) of n independent latent vari- 
ables associated with a standard regression model, 
i.e. such that Zi\(3 ~ N (x lT /3, l) , where the x lT 's 
are the p-dimensional covariates and (3 is the vec- 
tor of regression coefficients, then y = (yi, ■ ■ ■ ,y n ) 
defined by = I 2i >o is a probit sample. Indeed, 
given /3, the y^s are independent Bernoulli rv's with 
¥{Y Z = l|x\/3) = $ (x iT /3). < 

Such latent variables models are quite popular in 
most applied fields. For instance, a stochastic volatil- 
ity model ( |Jacquier et""aL|l994[ |Chib et al.||2002[ ) in- 
cludes as many (volatility) latent variables as obser- 
vations. In a time series with thousands of periods, 
this feature means a considerable increase in the com- 
plexity of the problem, as the volatilities cannot be 
integrated analytically and thus need to be simulated. 
Similarly, phylogenetic trees (REF) that reconstruct 
ancestral histories in population genetics are random 
trees and a nuisance parameter for inference about 
evolutionary mechanisms, but, once more, they can- 
not be integrated. 

Example 3. Capture-recapture experiments are 
used in ecology to assess the size and the patterns of 
a population of animals by a series of captures where 
captured animals are marked, i.e. individualy identi- 
fied as having been captured once, and released. The 
occurence of recaptures is then informative about the 
whole population. A longer description is provided 
in Marin and Robert (2007, Chapter 5), but we only 
consider here a three stage open population capture- 
recapture model, where there is a probability q for 
each individual in the population to leave the pop- 
ulation between each capture episode. Due to this 
possible emigration of animals, the associated like- 
lihood involves unobserved indicators and we study 
here the case where only the individuals captured 
during the first capture experiment are marked and 
subsequent recaptures are registered. This model is 
thus described via the summary statistics 

ni~a(N t p), ri|m~^(ni,?), 

r 2 \ni,ri ~ SS(n x - n,q) , c 2 |ni,ri ~ @{n\ - n,p), 



and 



C3\ni,ri,r 2 ~ SS{n x -t\- r 2 ,p) , 
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where only the first capture size, m, the first recap- 
ture size, C2, and the second recapture size, C3, are ob- 
served. The numbers of marked individuals removed 
at stages 1 and 2, n and r 2 , are not observed and are 
therefore latent variables of the model. If we incor- 
porate those missing variables within the parameters, 
the likelihood £(N,p, q, r±, r2\nx, C2, C3) is given by 

(^Wl-p)"- ni h) q ^(l-q) n ^ 
x ~ ri U C2 (l-p) ni - ri - C2 

x r* 1 ~ r M ? r2 (i- 9 ) ni - ri - r2 

and, if we use the improper prior n(N,p,q) = 



i\r- 1 i [0 ,i](p)i [ o,i](«), 



the posterior on the 



(N,p, q, n, ra I ?ii , C2, C3) is available up to a constant. 
Summing over all possible values of (7"i,r 2 ) to 
obtain the posterior associated with the "observed" 
likelihood creates some difficulties when n\ is large. 
Indeed, this summation typically introduces a lot of 
numerical errors. 

The dataset associated with this example is ex- 



tracted from Marin and Robert's (2007 Chapter 5) 
eurodip dataset and is related to a population of birds 
called European dippers^ For the 1981 captures, we 
have n\ = 22, c 2 = 11, and C3 = 6. 

The following example is a different case where the 
likelihood is missing a term that cannot be reconsti- 
tuted by completion and thus requires a custom-built 
solution. 

Example 4. The k -nearest-neighbour procedure is a 
classification procedure that uses a training dataset 
{Uii Xj)i<i<„ for prediction purposes. The observ- 
ables yi are class labels, yi € {1, . . . , G}, while the 
Xj are covariates, possible of large dimension. When 



2 European dippers are strongly dependent on streams, feed- 
ing on underwater invertebrates, and their nests are always 
close to water. The capture— recapture data contained in the 
eurodip dataset covers 7 years of observations in a zone of 200 
km 2 in eastern France. 



observing a new covariate x Il+ i, the corresponding 
unobserved label y n +i is predicted as the most com- 
mon class label found in the k nearest neighbours of 
x n+ i in X = {x 1; . . . ,x n }, the neighbours of a co- 
variate vector being defined by the usual Euclidean 
norm. Cucala et al. (2009) have proposed a proba- 



bilistic model for this classification mechanism. They 
first propose to model the distribution of y: 

f(y\X,/3,k) = 

exp E E S Vi W / fc ) I Z (P> fc ) W 

where 5 x (y) denotes the Kroenecker delta, Z(f3,k) is 
the normalising constant of the density and where 
£ i means that the summation is taken over the 
observations Xj for which x^ is a fc-nearest neighbour. 
The motivation for this modelling is that the full con- 
ditionals corresponding to Q are given by 




(5) 



The normalising constant Z(/3, k) cannot therefore be 
expressed in closed form. Indeed, the computation of 
this constant calls for a summation over G n terms. 

Based on ([5]), the predictive distribution of a new 
observation y n +i given its covariate x n+ i and the 
training sample (y, X) is, for g = 1, . . . , G, 

P(y n +i = g\x n +i,y, X, j3, k) cx 
exp</3/fc ^ 5 g(Vi)+ E M#) 

[ V~k(n+1) (n+l)~ fc £ 

where 

E S g(ve) and E Mff) 

£~ fc (n+l) (n+l)~ k t 

are the numbers of observations in the training 
dataset from class g among the k nearest neighbours 
of x Jl+ i and among the observations for which x n+ i 
is a fc-nearest neighbour, respectively 
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3 Monte Carlo Methods 

The generic approach for solving computational prob- 
lems related with Bayesian analysis is to use simula- 
tion, i.e. to produce via a computer program a sam- 
ple from the posterior distribution and to use the 
simulated sample to approximate the procedures of 
interest. This approach goes under the generic name 
of Monte Carlo methods, in reference to the casino 
of Monaco ( [Metropolis 1987). Recall that a stan- 
dard Bayesian estimate is the posterior expectation 
of functions h(9) of the parameter, 



h{0)w{0\y)d0. 



A formal Monte Carlo algorithm associated with the 
target 3 proceeds as follows: 



Basic Monte Carlo Algorithm 
For a computing effort N 

1) Set i = 1, 

2) Generate independent 0^ 1 ' from the posterior 
distribution 7r(-|y), 

3) Set i = i + 1, 

4) If i < N, return to 2). 



The corresponding crude Monte Carlo approxima- 
tion of 3 is given by: 



i N 

5MC = ^EK 0(i) ) 



When the computing effort N grows to infinity, the 
approximation 3 MC converges to 3 and the speed of 
convergence is 1/ y/~N if h is square-integrable against 
7r(0|y) ( Robert and Casella|2 004 ) . The assessment of 
this convergence relies on the Central Limit Theorem, 



as described in Robert and Casella (2009, Chapter 4) 



3.1 



Importance sampling and resam- 
pling 



A generalisation of the basic Monte Carlo algorithm 
stems from an alternative representation of the above 



integral 3, changing both the integrating density and 
the integrand: 



h(0)n(e\y) 
9(0) 



g(0)dO, 



(0) 



where the support of the posterior distribution 7r(-|y) 
is included in the support of g(-). 

Importance Sampling Scheme 
For a computing effort N 

1) Set i = 1, 

2) Generate independent 0^ from the importance 
distribution <?(-), 

3) Calculate the importance weight 

w« =7r(0«|y) /g (V^ 

4) Set i = i + 1, 

5) If i < N, return to 2). 



The corresponding importance sampling approxi- 
mation of 3 is given by 



1 



N 

N ^ 

i=l 



(V 4 >) 



(7) 



From a formal perspective, the posterior density 
g(0) = 7r(0|y) is a possible (and the most natural) 
choice for the importance function g(0), leading back 
to the basic Monte Carlo algorithm. However, ^ 
states that a single integral may be approximated in 
infinitely many ways. Maybe surprisingly, the choice 
of the posterior g(ff) — 7r(0|y) is generaly far from be- 
ing the most efficient choice of importance function. 
While the representation ^ holds in wide generality 
(the only requirement is that the support of 7r(-|x) 
should be included in the one of <?(•)), the choice of 
<?(•) is fundamental to provide good approximations 
of 3. Poor choices of g(-) lead to unreliable approxi- 
mations: for instance, if 



h 2 {0)u 2 {0)g{0)A0 
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is infinite, the variance of the estimator ^ is also 
infinite (Robert and Casella 2009 Chapters 3 and 
4) and then ([7j) cannot be used for approximation 
purposes. 

We stress here that, while Monte Carlo methods do 
not formaly suffer from the "curse of dimensionality" 
in the sense that, contrary to numerical methods, the 
error of the Monte Carlo estimators is always decreas- 
ing in 1/ yN, notwithstanding the dimension of the 
parameter space 0, the difficulty increases with the 
dimension p of O in that deriving satisfactory impor- 
tance sampling distributions becomes more difficult 
as p gets larger. As detailed in Section 3.2 



lution for deriving satisfactory importance functions 
in large dimensions is to turn to iterative versions of 
importance sampling. 

Example 5 (Continuation of Example [l}. In the 
case of the probit model, the posterior distribution, 
proportional to ([2]) cannot be easily simulated, even 
though it is bounded from above by the prior density 
7r(/3|X)0 

In this setting, we propose to use as importance 
distribution a normal distribution with mean equal 
to the maximum likelihood (ML) estimate of (3 and 
with covariance matrix equal to the estimated covari- 
ance matrix of the ML estimate. While, in general, 
those normal distributions provide crude approxima- 
tions to the posterior distributions, the specific case 
of the probit model shows this is an exceptionally 
good approximation to the posterior. For instance, 
if we compare the weights resulting from using this 
normal distribution with the weights resulting from 
using the prior distribution as importance function, 
the range of the former weights is much more concen- 
trated than for the later weights, as shown by Figure 
[I] (Note that, due to the missing normalising con- 
stant in 7r(/3|X,y), the weights are computed with 
the product ([2| as the target function.) -4 

As noted in the above example, a common feature 
of Bayesian integration settings is that the normalis- 
ing constant of the posterior distribution, m(y), can- 

3 This feature means that the accept-reject algorithm 
l |Robert and Casella|2009| Chapter 2) could formally be used 



for the simulation of 7r(/3|X, y), but the efficiency of this ap- 
proach would be quite poor. 



not be computed in closed form. In that case, cjW 
and cannot be used and they are replaced by the 
unormalised version 

u;W=m(y)7T(0 W |y)/2(0 W ) 
and by the self-normalized version 



-rSNIS 



N 
i=l 



, N 

5>« 



respectively. The self-normalized 3^ ms also con- 



verges to 3 since Y^iLi converges to the normal- 



a so- ising constant m(y). The weights (i = 1, . . . , T) 



5J« 



V 



JV 



are then called normalised weights and, since they 
sum up to one, they induce a probability distribution 
on the sample of #W's. When chosing an importance 
function, the adequation with the posterior distribu- 
tion needs to get higher as the dimension p increases. 
Otherwise, very few weights uj^ are different from 
and even the largest weight, which is then close to 
1, may correspond to an unlikely value for the pos- 
terior distribution, its closeness to 1 being then an 
artifact of the renormalisation and not an indicator 
of its worth. A related measure of performance of the 
importance function is given by the effective sample 



ESS 



N 



I N 

EM' 



1=1 



For a uniformly weighted sample, ESSat is equal to JV, 
while, for a completely degenerated sample where all 
importance weights but one are zero, ESSjv is equal 
to 1. The effective sample size thus evaluates the size 
of the iid sample equivalent to the weighted sample 
and allows for a direct comparison of samplers. 

Example 6 (Continuation of Example [5]). For the 

two schemes tested in the probit model of Example 
[5j using the same number N = 10, 000 of simulations, 
the effective sample sizes are T\ — 6291.45 and T2 = 
9.77 for the ML based and prior normal importance 
functions, respectively. < 
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Prior sampling 



Importance sampling 



7 ' 



importance sampling 



-700 



-500 



-210 



-195 



Figure 1: Boxplot and histograms of the logarithms of the importance weights corresponding to 10 4 simu- 
lations from the prior distribution (prior sampling) and from the MLE normal approximation (importance 
sampling) in the setup of the Pima Indian diabetes study of Example [l] The graphs for the prior sampling 
are excluding the 1690 zero weights from the representation. 



While importance sampling is primarily an integral 
approximation methodology, it can also be used for 
simulation purposes, via the sampling importance re- 
sampling (SIR) methodology of |Rubin] ( |1988[ ). Given 
a weighted sample (0 (1) , w«), ' . . , (0^\u;W) sim- 
ulated from (?(•), it is possible to derive a sample 
approximately distributed from the target distribu- 
tion 7r(-|x), 6 ,...,8^ \ by resampling from the 
instrumental sample 0^~\ . . . , 6^ N ' using the impor- 
tance weights, that is, 



Ki< M. 



where the random variables J\ , . . . , Jm are dis- 
tributed as 



(see, e.g., Robert and Casella||2009 Chapter 3) 



3.2 Sequential importance sampling 

In general, importance sampling techniques require a 
rather careful tuning to be of any use, especially in 
large dimensions. While MCMC methods (Section^ 
are a ready-made solution to this problem, given that 



they can break the global distribution into distribu- 
tions with smaller dimensions, the recent literature 
has seen an extension of importance sampling that 
adaptively calibrates some importance functions to- 
wards more similarity with the target density ( Cappe 



et al.||2004 


Del Moral et al.||2006 Douc et al.| 


2007a, 


Cappe et al.||2008 


1. 



The method is called sequential Monte Carlo 
(SMC) because it evolves along a time axis either 
through the target — as in regular sequential statis- 
tical problems — or through the importance function, 
and also population Monte Carlo (PMC), following 
|Iba| ( |2000[ ), because it produces populations rather 
than pointsr] Although the idea has connections 



with the earlier particle filter literature ( 


Gordon et al. 


1993[|Doucet et al. 2001 


Cappe et al. 2004 


) , the main 



principle of this method is to build a sequence of 
increasingly better — against a performance criterion 
that may be the entropy divergence from the tar- 
get distribution or the variance of the corresponding 
estimator of a fixed integral — proposal distributions 
through a sequence of simulated samples (which thus 
behave like populations). Given that the validation 



4 This simulated population is then used to devise new and 
hopefully improved importance (or proposals) functions. 
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of the technique is still based on sampling impor- 
tance resampling principles, the resulting dependence 
on past samples can be arbitrarily complex, while 
the approximation to the target remains valid (unbi- 
ased) at each iteration and while it does not require 
asymptotic convergence as MCMC methods do (see 
Section [4]) . A very recent connection between both 



are simulated either from a non-parametric kernel- 
like proposal of the form 



Bj,t- 

3=1 



K t (0 



8) 



where K t is a Markov kernel modified at each iter- 



approaches can be found in Andrieu et al. (2010) and 
the discussion therein. 

While the following algorithm does appear as a re- 
peated (or sequential) sampling importance resam- 
pling algorithm (Rubin 1988), the major update is 
the open choice of qu in the first step, since qu can 
depend on all past simulated samples as well as on the 
index of the currently simulated value. For instance, 
in 



ation (Douc et al. 2007b) or from a mixture of the 
form 



J2 

i=i 



i 9j(0j,t-i\tij,t-i) , 



where gj is a standard distribution from an expo- 
nential family parameterised by £, both parameters 



and weights being updated at each iteration ( Cappe 



Cappe et all (120081), mixtures of standard kernels |et al.| |2008). An illustration of the performances of 



are used with an update of the weights and of the 
parameters of those kernels at each iteration in such 
a way that the entropy distance of the correspond- 
ing importance sampling estimator to the target are 
decreasing from one iteration to the other. 

General Population Monte Carlo Algo- 
rithm 

For a computing effort N 

1) Generate (6>i,o)i<;<jv ~ qo and compute 

w 4 , = 7r(0i,o|y)/go(0i,o). 

2) Generate (J l:0 )i<i<jv ~ M(l, (S7i,o)i<i<jv) 
and set d ifi = 9, hofi (1 < i < N), 

3) Set t = 1, 

4) Conditionally on past 9 L j's and Oi.j's, generate 
independently 9 it ~ q i t and compute u>n — 

7r(0 <lt |y)/M0<,i), 

4) Generate [Ji,t)i<%<N ~ M(l, (U itt )i<i< N ) 
and set 9 lA = 0j M>t (1 < i < N), 

5) Set t = t+l, 

6) If t < N return to 4). 



this PMC algorithm for a cosmological target is given 
in Wraith et al. (2009), while an ABC extension has 



been introduced by Beaumont et al. ( 2009 ) 



Since PMC produces at each iteration a valid ap- 
proximation to the target distribution, the popula- 
tions 6i tt produced at each of those iterations should 
not be dismissed for approximation purposes. |Cor- 



nuet et al. (2009) have developped a nearly opti- 



mal strategy recycling all past simulations, based on 
the multiple mixture technique of |Owen and Zhou| 
( 2000[ ) and called adaptive multiple importance sam- 
pling (AMIS). 

3.3 Approximations of the Bayes fac- 
tor 

As already explained above, when testing for an null 
hypothesis (or a model) H : 9 € O against the 
alternative hypothesis (or the alternative model) Hi : 
9 € ©i, the Bayes factor is defined by 



B m (y) 



/o(y|0o)*-o(0o)d0o 



A(y|0iM0i)d0i 



8i 



In this representation, while the choice of qu is 
completely open, a convenient case is when the flat's 



The computation of Monte Carlo approximations of 
the Bayes factor ^ has undergone rapid changes in 



the last decade as illustrated by the book of Chen 



et al. ( 2000[ ) and the recent survey of Robert and 
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Marin (2010). We assume here that the prior dis- 



tributions under both the null and the alternative 
hypotheses are proper, as, typically they should be. 
(In the case of common nuisance parameters, a com- 
mon improper prior measure can be used on those, 
see |Berger et ah] (|1998[), |Marin and Robert| (|2007[). 



that lead to a significant reduction in the variance of 
the importance sampling estimate. 

In the special case when the parameter spaces 
of both models under comparison are identical, 



This complicates the computational aspect, as some 
methods like crude Monte Carlo cannot be used at 
all, while others are more prone to suffer from infi- 
nite variance.) In that setting, the most elementary 
approximation to -Boi(y) consists in using a ratio of 
two standard Monte Carlo approximations based on 
simulations from the corresponding priors. Indeed, 
for i = 0, 1: 



i.e. 0q = 0i, a bridge sampling approach (Meng and 
Wong|[l996 ) is based on the general representation 



S i(y) 



My\O)n l (0)d0 = E lri [/(y|0)] 



/o(y|0)7T O (0)a(0)7ri(%)d0 
/i(y|0)7n(0)a(0)7r o (0|y)d0 



n 1 - 1 ^/o(y|^)7ro(0i J )«(0 1 , J -) 
j=i 

no 

1 X;/i(y|0o,i)7r 1 (0o,i)a(0o I i) 



Then, if O i, • . . ,0on o and 0i i, . . . ,0i ni are two in- where o ,i 



dependent samples generated from the prior distribu- 
tions 7To and 7Ti, respectively, 



is a strongly consistent estimator of Bpi(y)- 

Defining two importance distributions with densi- 
ties (?o and g\ , with the same supports as ttq and 7Ti , 
respectively, we have: 

Boi(y) =E 3 „ [/o(y|0)7r o (0)/5o(0)] / 
E 91 [/i(y|0)^i(0)/gi(0)] . 

Therefore, given two independent samples gener- 
ated from distributions go and g±, respectively, 
0o,i> •■■■> e o,n and M , . . . , 6 ljni , the corresponding 
importance sampling estimate of _Boi(y) is 

n^^UHyWiMe^yg^j) ' 

Compared with the standard Monte Carlo approxi- 
mation above, this approach offers the advantage of 
opening the choice of the representation in that it is 
possible to pick importance distributions go and g\ 



i 0o, n and 0i,i, . . • , 0i, m are two inde- 
pendent samples coming from the posterior distribu- 
tions 7r o (0|y) and 7Ti(0|y), respectively. That applies 
for any positive function a such that the upper inte- 
gral exists. Some choices of a can lead to very poor 
performances of the method in connection with the 
harmonic mean approach (see below), but there ex- 
ists a quasi-optimal solution, as provided by German 
and MengMl998|): 



oe*(y) oc 



^o7ro(0|y) + rci7Ti(0|y) 



This optimum cannot be used per se, since it re- 
quires the normalising constants of both 7To(0|y) and 
7i"i(0|y). As suggested by Gelman and Meng (1998), 
an approximate but practical version uses iterative 
versions of a*, the current approximation of a* being 
used to produce a new bridge sampling approxima- 
tion of -Boi(y)j which in its turn is used to set a new 
approximation of a*. Note that this solution recy- 
cles simulations from both posteriors, which is quite 
appropriate since one model is selected via the Bayes 
factor, instead of using an importance weighted sam- 
ple common to both approximations. We will see be- 
low an alternative representation of the bridge factor 
that bypasses this difficulty (if difficulty there is!). 

Those derivations are however restricted to the 
case when both models have the same complexity and 



10 



thus they do not apply to embedded models, when 
0o C 81 in such a way that Q\ = (9,ip), i.e. when 
the submodel corresponds to a specific value ipo of ijj: 

fo{y\o) = fx{y\o,i*). 

The extension of the most advanced bridge sam- 
pling strategies to such cases requires the introduc- 
tion of a pseudo-posterior density, w(0|#,y), on the 
parameter that does not appear in the embedded 
model, in order to reconstitute the equivalence be- 
tween both parameter spaces. Indeed, if we augment 
7To(#|y) with uj(ip\6, y), we obtain a joint distribution 
with density n (0\y) x uj(%jj\d,y) on 81. The Bayes 
factor -Boi(y) can then be expressed as 



/ My\o, 4>o)M0)a(0, , V|y)d6MV|0, y) dV> 

/ / 1 (y|0,V)7ri(0,^)Q:(0,V)vro(6»|y)a J (t/.|0,y)d6)dV 

Je 1 



(8) 



because it is clearly independent from the choice of 
both a(6,ip) and w(ip\9, y). Obviously, the perfor- 
mances of the approximation 

(»i)~ l Z)/ 1 & r l fll .i.Vto)wo(ei,i)w(^wiei,i,y)«(eM,^i.j) 

j=i 

("o) -1 ^2 h(y\ O,j> 1 Po,j)^i(So,j,^O,j)a(e O j,ip O j) 



where 



(0O,lj ^0,1)1 ■ * * ) (^0,n i 00, n ) 



(^l,l,01,l),.-.,(^l,n 1 ,'01, 



arc 



two 



and 
inde- 
pendent samples generated from distributions 
7r o (0|y) x uj(ip\9,y) and ni(0,ip\y), respectively, do 
depend on this completion by the pseudo-posterior 



as well as on the function a(<?,0). Chen et al. (2000) 
establish that the asymptotically optimal choice for 
w(0|0,y) is the obvious one, namely 

w(V|0,y)=7riM0,y), 

which most often is unavailable in closed form (espe- 
cially when considering that the normalising constant 
of uj(tp\6,y) is required in Q). 

Another approach to approximating the marginal 
likelihood is based on harmonic means. If 6ij ~ TTi(-) 
(i = 1, 2, j = 1, . . . , A), the prior distribution, then 



1 N 

- y 



1 



n fri fi(y\e itj ) 



is an unbiased estimator of l/m^y) ( |Newton and 
Raftery 1994). This generic harmonic mean is too 



often associated with an infinite variance to ever be 



recommended (Neal 19941, but the representation 
dGelfand and Dey||1994[ ) (i = 0, 1) 



Vi(0) 



dO 



mi(y) 



holds, no matter what the density (fi is, provided 
<Pi(0i) = when 7Ti(0 4 )/i(y|0<) = 0. This repre- 
sentation is remarkable in that it allows for a direct 
processing of Monte Carlo (or MCMC) output from 
the posterior distribution ^i{0i\y). As with impor- 
tance sampling approximations, the variability of the 
corresponding estimator of i?oi(y) will be small if 
the distributions (fi (i = 0, 1) are close to the cor- 
responding posterior distributions. However, as op- 
posed to usual importance sampling constraints, the 
density ifi must have lighter — rather than fatter — 
tails than 7Ti(-)/i(y|') f° r the approximation of the 
marginal nii(x) 



N 



N 



when 9i j ~ iTi(6\y), to enjoy finite variance. For in- 
stance, using tfi's with constrained supports derived 
from a Monte Carlo sample, like the convex hull of 
the simulations corresponding to the 10% or to the 
25% HPD regions — that again is easily derived from 
the simulations — is both completely appropriate and 
implementable ( Robert and Wraith||2009 ). 

Example 7 (Continuation of Example [5]). In the 
case of the probit model, if we use as distributions 
ipi the normal distributions with means equal to the 
ML estimates and covariance matrices equal to the 
estimated covariance matrices of the ML estimates, 
the results of Robert and Marm| ( |2010 ), obtained 
over 100 replications with A = 20, 000 simulations 
each are reproduced in Figure [2j They compare 
both approaches — harmonic mean and importance 
sampling — to the approximation of the Bayes factor 
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Figure 2: Monte Carlo experiment comparing the 
variability of the approximations to the Bayes fac- 
tor £?io(y) based on harmonic mean and importance 
sampling for the Pima Indian diabetes study of Ex- 
ample [I] The boxplots are obtained for 100 repli- 
cations of 20, 000 simulations from the normal im- 



portance sampling distribution. (Source: Robert and 



Marin 2010). 



testing for the significance of the ped covariate and 
show a very clear proximity between both importance 
solutions in this special case, even though the impor- 
tance sampling estimate is much faster to compute. 
Simulation from the posterior distribution is obtained 
by an MCMC algorithm described in Section [4] < 

A final approach to the approximation of Bayes fac- 
tors that is worth exploring is Chib's (1995) method. 
First, it is a direct application of Bayes' theorem: 
given y ~ fi(y\9), we have that 



TOi(y) 



My\Q)^(Q) 



for all 0's (since both the lhs and the rhs of this 
equality are constant in 6). Therefore, if an arbitrary 
value 0* , is selected and if a good approximation to 



7Tj(0*|y) is available, denoted 7^(0* |y), Chib's (1995) 
approximation to the marginal likelihood (and hence 
to the Bayes factor) is 



TOi(y) 



/ i; (y|0*K(0*) 



In a general setting, TTi(0*\y) may be the normal ap- 
proximation based on the MLE, already used in the 
importance sampling, bridge sampling and harmonic 
mean solutions, but this is unlikely to be accurate 
in a general framework. A second solution is to use 
a nonparametric approximation based on a prelimi- 
nary MCMC sample, even though the accuracy may 
also suffer in large dimensions. In the special set- 
ting of latent variables models introduced in Section 
2.2 Chib's (1995) approximation is particularly at- 
tractive as there exists a natural approximation to 



7Tfc(0|y), based on the Rao-Blackwell (Gelfand and 
Smith] 1990| estimate 



(9) 



1 N 

3=1 

where the z.,-'s are the latent variables simulated by 
the MCMC sampler. The estimate 7r fc (#*|y) is indeed 
a parametric unbiased approximation of 7r^.(0*|y) 
that converges with rate 0(1 /vN), It obviously re- 
quires the full conditional density 7^(0* |y,z) to be 
available in closed form (constant included) but, for 
instance, this is the case for the probit model of Ex- 
ample [I] 

Example 8 (Continuatio n of Example [7]) ■ F i gure [3| 
reproduces the results of Robert and Marin ( 2010[ ) 
obtained for 100 replications of Chib's approxima- 
tions of i?oi(y) for the same test as in Example [7] 
with N = 20, 000 simulations for each approxima- 
tion of mi(y) (i = 0,1). While Chib's method is 
usually very reliable and dominates importance sam- 
pling, the incredibly good approximation provided by 
the asymptotic normal distribution implies that, in 
this highly special case, Chib's method is dominated 
by both the importance sampling and the harmonic 
mean estimates. < 

While the methods presented above cannot ranked 
in a fixed order for all types of problems, the con- 
clusion of Robert and Marin|2010| is worth repeating 
here. In cases when a good approximation <?(•) to 
the true posterior distribution of a model is avail- 
able, it should be used in a regular importance sam- 
pling evaluation of the marginal likelihood. Given 
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that this good fit rarely occurs in complex and new 
settings, more generic solutions like Chib's (1995) 
should be used, whenever available. (When used 
with a bounded support on the <fiS, the harmonic 
mean approximation can be considered as a generic 
method.) At last, when faced with a large number 
or even an infinity of models to compare, the only 
available solution is to use model jump techniques 
like reversible jump MCMC flGreen||T99"5"l). 



Figure 3: Monte Carlo experiment comparing the 
variability of the approximations to the Bayes factor 
Bio (y) based on Chib's representation and on har- 
monic mean and importance sampling for the Pima 
Indian diabetes study of Example[l] The boxplots are 
obtained for 100 replications of 20, 000 simulations 
from the posterior and the importance sampling dis- 
tributions, respectively. (Source: Robert and Marin 



2010) 



4 Markov chain Monte Carlo 
methods 

4.1 Basics 

Given the difficulties involved in constructing an effi- 
cient importance function in complex setting, Markov 



chain Monte Carlo (MCMC) methods ( jGelfand and 
Smith||19901 |Robert and Casella| [20041 



2009, Marin 



and Robert||2007[ ) try to overcome some of the limi- 



tations of regular Monte Carlo methods (particularly 
dimension-wise) by simulating a Markov chain with 
stationary (and limiting) distribution the target dis- 
tribution)^] There exist fairly generic ways of produc- 
ing such chains, including the Metropolis-Hastings 
and Gibbs algorithms defined below. Besides the fact 
that stationarity of the target distribution is enough 
to justify a simulation method by Markov chain gen- 
eration, the idea at the core of MCMC algorithms is 
that local exploration, when properly weighted, can 
lead to a valid (global) representation of the distri- 
bution of interest. This includes for instance using 
only component-wise (and hence small-dimensional) 
simulations — that escape (to some extent) the curse 
of dimensionality — as in the Gibbs sampler. 

This very short introduction may give the impres- 
sion that MCMC simulation is only superficially dif- 
ferent from other Monte Carlo methods. When com- 
pared with alternatives such as importance sampling, 
MCMC methods differ on two issues: 



5 The theoretical foundations of MCMC algorithms are 
both sound and simple: as stressed by |Tierney| l |1994[ | and 
|Mengersen and Tweedie] jl996| l, the existence of a stationary 
distribution almost immediately validates the principle of a 
simulation algorithm based on a Markov kernel. 
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1. the output (0W) of an MCMC algorithm is only 
asymptotically distributed from the target distri- 
bution. While this usually is irrelevant, in that 
the flW's are very quickly distributed from that 
target, it may also happen that the algorithm 
fails to converge in the prescribed number of it- 
erations and thus that the resulting estimation 
is biased; 

2. the sequence (9^) being a Markov chain, the 
0W's are correlated and therefore this modifies 
the evaluation of the asymptotic variance as well 
as the effective sample size associated with the 
output. 

We also note here that trying to produce an iid se- 
quence out of a MCMC method is highly inefficient 
and thus not recommended. 



4.2 Metropolis— Hastings algorithm 

The Metropolis-Hastings algorithm truly is the 
generic MCMC method in that it offers a straight- 
forward and universal solution to the problem of 
simulating from an arbitrarjj^] posterior distribution 
7r(0|x) oc f(y\8) 7r(0): starting from an arbitrary 
point 6q, the corresponding Markov chain explores 
the surface of this posterior distribution using an in- 
ternal Markov kernel (proposal) q(8\8^ t ~ 1 ^) that pro- 
gressively visits the whole range of the possible values 
of 0. This internal Markov kernel should be irre- 
ducible with respect to the target distribution (that 
is, the Markov chain associate whith the proposal 
?(•]•) should be able to visit the whole support of the 
target distribution). The reason why the resulting 
chain does converge to the target distribution despite 
the arbitrary choice of q(8\8^ t ~ 1 ^) is that the pro- 
posed values are sometimes rejected by a step that 
relates with the accept-reject algorithm. 



°The only restriction is that this function is known up to a 
normalising constant. 



Metropolis Hastings Algorithm 
For a computing effort N 

1) Choose (o) , 

2) Set t = 1, 

3) Generate 8' from g(-|0 (i-1) ), 

4) Generate u from U]n t \\ • 

,s lf < <e')f{y\e')g{9^\e') 

' — ^(^^"^ J T(y|l9C*- 1 >), ? (i9n^ (t - :L) ) ' 
set 8^=8' else 0« =8^, 

6) Set t = t + 1, 

7) If t < N return to 3). 

A generic choice for q{8\8^ 1 ^) is the random walk 
proposal: = g(8-8 {t ^ 1) ) with a symmet- 

ric function g, which provides a simplified acceptance 
probability. Indeed, in that case, step 5) of the pre- 
vious algorithm is replaced with: if 

< <e')f{y\0') 

- Ao {t - l) )f{y\8^) ' 

set 8^=8' else 8^=8^. 

This ensures that values 8' that are more likely than 
the current fl*-' -1 -* are always accepted while values 
that are less likely are sometimes accepted. 

Example 9 (Continuation of Example [I]). If we con- 
sider the Pima Indian diabetes dataset with only its 
first two covariates, the parameter (3 is of dimension 
2 and the random walk proposal can be easily im- 
plemented. We use for g a normal distribution with 
covariance matrix the asymptotic covariance matrix 
£ of the MLE and the proposed value is then sim- 
ulated at iteration t as 

The MLE may also be used as starting value for 
the chain. Figure [4] illustrates the behaviour of 
the Metropolis-Hastings algorithm for this dataset, 
the lhs graph describing the path of the subchain 
^(ioot)-j an( ^ ^_ ne r ^ s ^tailing the first component 
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for 5000 < t < 6000. Although this is not clearly 
visible on the rhs graph, the acceptance rate of the 
algorithm is close to 50%, which means that half of 
the proposed /3's are rejected^ Using a covariance 
matrix that is five times larger leads to an accep- 
tance rate of 25%, while the larger 10S produces an 
acceptance rate of 15%. -4 



Finding the proper scale is not always as straight- 
forward as in Example [8] and asymptotic normal ap- 
proximations to the posterior distribution may be 
very inefficient. While the Metropolis-Hastings algo- 
rithm recovers better from facing large-dimensional 
problems than standard importance sampling tech- 
niques, this still is a strong limitation to its use in 
large-dimensional setups. 



4.3 Gibbs sampling 



In contrast, the alternative Gibbs sampler is an at- 
tractive algorithm for large-dimensional problems be- 
cause it naturally fits the hierarchical structures of- 
ten present in Bayesian models and more gener- 
ally in graphical and latent variable models. The 
fundamental strength of the Gibbs sampler is its 
ability to break a joint target distribution like 
7r(0i, . . . , 9 p \y) in the corresponding conditional dis- 
tributions 7Tj(0j|y, 9— i) (i = 1, . . . , n.) and to simulate 
successively from these low-dimensional targets: 



p-component systematic scan Gibbs sam- 
pler 

For a computing effort N 

1) Choose (o) , 

2) Set t = 1, 

3) Generate e[ t] from m (#i|y, fl^Y 

4) Generate 9^ from tt 2 (e 2 \y, 6 ( * \ 0^2 

5) ... 

6) Generate 9® ~ n p (# P |y, 0^ (p _ 1} 

7) Set t = t + l, 

8) If t < N return to 3). 



While this algorithm seems restricted to mostly hi- 
erarchical multidimensional models, the special case 
of the slice sampler (Robert and Casella, 2004, Chap- 
ter 8) shows that the Gibbs sampler applies in a wide 
variety of models. 

Example 10. (Continuation of Example [2]) As noted 
in Example [2j the probit model allows for a latent 
variable representation based on the artificial normal 
variable z t connected with the observed variable y t - 
This representation opens the door to a Gibbs sam- 



pler (Albert and Chib 1993) aimed at the joint pos- 



terior distribution of (/3,z) given y. Indeed, the con- 
ditional distribution of the latent variable Zt given f3 
and y t , 



Zt\yt,P 



AA+( X T/3,1,0) if 



Vt 



1 



AA_(xT/3,l,0) if y t = 0, 



(10) 



is clearly available^] The corresponding full condi- 
tional on the parameters is given by the standard 



7 This rate happens to be almost optimal for small dimen- 
sions JGelman et al.|1996l. 



8 Herc, JV+ (x^ /3, 1, 0) denotes the normal distribution with 
mean x.J f3 and variance 1 that is left-truncated at 0, while 
M— (xj /3, 1,0) denotes the symmetrical normal distribution 
that is right-truncated at 0. 
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Figure 4: Random-walk Metropolis-Hastings algorithm applied to the Pima Indian diabetes dataset. The 
left graph describes the path of the subchain (/3^ 100t ^ ') . The right graph shows the path of the first component 
chain fif> for 5000 < t < 6000. 



normal distribution (which does not depend on y) 4.4 Hybrid solutions 



/3\z~M 



-(X T X)- 1 X T z, 



Tv\-1 



1 



(X X X 



(11) 



Therefore, given the current value of /3, one cycle 
of the Gibbs algorithm produces a new value for z 



as simulated from the conditional distribution ( 10 ) 



which, when substituted into (111, produces a new 
value for j3. Although it does not impact the long- 
term properties of the sampler, the starting value of (3 
may once again be taken as the maximum likelihood 
estimate to avoid (useless) burning steps in the Gibbs 
sampler. 

The implementation of this Gibbs sampler is 
straightforward. There is no parameter to cali- 
brate (as opposed to the scale in the random-walk 
Metropolis-Hastings scenario). When comparing 
Figure [4] and [5J the raw plot of the sequence (/3 1 *' ) ) 
shows that the mixing behaviour of the Gibbs sam- 
pling chain is superior to the one for the Metropolis- 
Hastings chain. A 



Mixing both Metropolis-Hastings and Gibbs algo- 
rithms often result in better performances like faster 
convergence of the resulting Markov chain, the former 
algorithm being often used for global exploration of 
the target and the later for local improvement. 

A classic hybrid algorithm replaces a non- 
available Gibbs update by a Metropolis-Hastings 
step. Another hybrid solution alternates Gibbs and 
Metropolis-Hastings proposals. The corresponding 
algorithms arc valid: they produce ergodic Markov 
chains with the posterior target as stationary distri- 
bution. 

Example 11 (Continuation of Example [5]). For p — 
1, the probit model can be over-parameterised as 

V{Y t = l\ Xi ) = 1 - P(Y; = OK) = HxiP/a) , 

while only depending on /3/<r. Using a proper prior 
like 

^(/3,a 2 |x) = 7r(/3|x)7r(a 2 |x) 

cx cr~ 4 exp{-l/cr 2 } exp{-/3 2 /50) , 
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5400 5600 
Iterations 



Figure 5: Gibbs sampling algorithm applied to the Pima Indian diabetes dataset. The left graph describes 
the path of the subchain (f3^ wot ^). The right graph shows the path of the first component chain /3^' for 
5000 < t < 6000. 



the corresponding Gibbs sampler simulates (3 and a 2 
alternatively, from 

n 

7r03|x,y,(7) ocH^Xip/ay^i-Xip/af-V'TriPlx) 



and 



4=1 

respectively. Since both of these conditional dis- 
tributions are non-standard, we replace the direct 
simulation by one-dimensional Metropolis-Hastings 
stepsj^] using normal Af(j3^\ 1) and log-normal 
CAf (log .04) random walk proposals, respec- 
tively. (The scales were found by trial-and-error.) 
For a simulated dataset of 1, 000 points, the contour 
plot of the log-posterior distribution is given in Figure 
[6j along with the last 1, 000 points of a corresponding 
MCMC sample after 100, 000 iterations. This graph 



9 In this Metropolis-within-Gibbs strategy, note that a sin- 
gle step of a Metropolis-Hastings move is sufficient to validate 




Figure 6: Contour plot of the log-posterior distri- 
bution for a probit sample of 1,000 observations, 
along with 1, 000 points of an MCMC sample (Source: 
Robert and Casella 2004)- 



shows a very satisfactory repartition of the simulated 
parameters over the likelihood surface, with higher 
concentrations near the largest posterior regions. -4 

Let us note as a conclusion to this short section 
that an alternative meaning for hybrid solutions is the 



the algorithm, since stationarity, not convergence, is the issue, simultaneous use of different Markov kernels (Tierney 
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1994). A mixture of MCMC kernels does remain an 
MCMC kernel with the same stationary distribution 
and its performances are at least as good as the best 
component in the mixture. There is therefore very 
little to say against advocating this extension. 



4.5 Scaling and adaptivity 



A difficulty with Metropolis-Hastings algorithms, in- 
cluding random walk versions, is the calibration of 
the proposal distribution: this proposal must be suf- 
ficiently related to the target distribution so that, in a 
reasonable number of steps, the whole support of this 
distribution can be visited. If the scale of the random 
walk proposal is too small, this will not happen as the 
algorithm stays "too local" and, if for instance there 
are several modes on the target, the algorithm may 
remain trapped within one modal region because it 
cannot reach other modal regions with jumps of too 
small a magnitude. 



Example 12. For a sample y±, 
ture distribution 



from the mix- 



pA%i,a 2 ) + (l-p)A^2,<7 2 ) 

where both p and a 2 are known, the posterior distri- 
bution associated with the prior Af(0, 10cr 2 ) on both 
Hi and fi2 is multimodal, with a major mode close to 
the true value of \i\ and fi2 (when n is large enough) 
and a secondary and spurious mode (that stems from 
the nonindcntifiable case p — 0.5). When running a 
random walk Metropolis-Hastings algorithm on this 
model, with a normal proposal ^((/^ , a4 )> T ^-2), a 
small scale r prevents the Markov chain from visit- 
ing the major mode. Figure [7] compares two choices 
of t for the same dataset: for r = 1, the spurious 
mode can be escaped but for t = .3 the chain re- 
mains trapped in that starting mode. -4 





I 










Figure 7: Evolution of a random walk Metropolis- 
Hastings chain on a mixture log-posterior surface for 
n = 500 observations and (top) r = 1 and 1,000 
iterations; (bottom) r = .3 and 10,000 iterations. 



The larger the dimension p is, the harder the de- 
termination of the scale is, because 



a. the curse of dimensionality implies that there is 
an increasingly important part of the space with 
zero probability under the target; 
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b. the knowledge and intuition about the modal re- 
gions get weaker (for complex distributions, it 
is impossible to identify none but a few of the 
modes); 

c. the proper scaling of a random walk proposal in- 
volves a symmetric (p,p) matrix. Even when 
diagonal, this matrix gets harder to scale as 
the dimension increases (unless one resorts to a 
Gibbs like implementation, where each direction 
is scaled separately). 

In addition to these difficulties, learning about the 
specificities of the target distribution while running 
an MCMC algorithm and tuning the proposal ac- 
cordingly, i.e. constructing an adaptive MCMC pro- 
cedure, is difficult because this cancels the Markov 
property of the original method and thus jeopar- 
dizes convergence. For instance, Figure [8] shows the 
discrepancy between an histogram of a simulated 
Markov chain and the theoretical limit (solid curve) 
when the proposal distribution at time T is a kernel 
approximation based on the first T—l simulations of 
the "chain" . Similarly, using an on-line scaling of the 
algorithm against the empirical acceptance rate in or- 



der to reach a golden number like 0.234 (see Robert 
and CaselTa||2004| Note 7.8.4) is inherently flawed in 



that the attraction of a modal region may give a false 
sense of convergence and may thus lead to a choice 
of too small a scale, simply because other modes will 
fail to be visited during the scaling experiment. 

However, there are algorithms that preserve ergod- 
icity (convergence to the target) while implementing 
adaptivity. See, e.g., Gilks et al. (19981 who use re- 



generation to create block independence and preserve 
Markovianity on the paths rather than on the values, 



Haario et al. ( 1999 2001 1 who derive a proper adap- 



tation scheme by using a ridge-like correction to the 
empirical variance in very large dimensions for satel- 



lite imaging data, and Andrieu et al. (2005) who pro- 



pose a general framework of valid adaptivity based 
on stochastic optimisation and the Robbin-Monro al- 
gorithm. 



More recently, Roberts and Rosenthal (2007) con 



sider basic ergodicity properties of adaptive Markov 
chain Monte Carlo algorithms under minimal as- 
sumptions, using coupling constructions. They 





Figure 8: Sample (left) produced by 50, 000 iter- 
ations of a nonparametric adaptive MCMC scheme 
and comparison (right) of its distribution with the 
target distribution (solid curve). (Source: Robert 

loM) 



prove convergence in distribution and a weak law of 
large numbers. Moreover, in |Roberts and Rosen-| 
thai (2009), they investigate the use of adaptive 



MCMC algorithms to automatically tune the Markov 
chain parameters during a run. Examples include 
the adaptive Metropolis multivariate algorithm of 



Haario et al. ( 2001 ) , Metropolis- within-Gibbs algo- 



rithms for nonconjugate hierarchical models, region- 
ally adjusted Metropolis algorithms, and logarith- 



mic scalings. Roberts and Rosenthal (20091 present 



some computer simulation results that indicate that 
the algorithms perform very well compared to non- 
adaptive algorithms, even in high dimensions. 



5 Approximate Bayesian com- 
putation techniques 

There exist situations where the likelihood function 
f{y\9) is overly expensive or even impossible to cal- 
culate, but where simulations from the density f{y\0) 
are reasonably produced. A generic class of such sit- 
uations is made by latent variable models where the 
analytic integration of the latent variables is impossi- 
ble, while handling the latent variables as additional 
parameters in a joint distribution causes any MCMC 
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to face convergence problems. Another illustration 
is given by inverse problems where computing the 
function f(y\9) for a given pair (y, 9) involves solv- 
ing a numerical equation. In such cases, it is almost 
impossible to use the computational tools presented 
in the previous section to sample from the posterior 
distribution n(8\y). Approximate Bayesian compu- 
tation (ABC) is an alternative to such techniques that 
only requires being able to sample from the likelihood 
f{-\9). It was first proposed for population genetic 
models ( Beaumont et al. 2002 1 but applies in much 



wider generality. 



Likelihood free rejection sampling 
For a computing effort N 

1) Set i = 1, 

2) Generate 9' from the prior distribution tt(-), 

3) Generate z from the likelihood f(-\9 ), 

4) If p(r)(z), rj(y)) < e, set 8i = d' and i = i + 1, 

5) If i < N, return to 2). 



This likelihood free algorithm samples from the 
marginal in z of the following joint distribution: 

7r e (0,z|y)cx^(0)/(z|0)W(z) 
with the tuning parameters as 

• p(-, •) a distance, 

• ry(-) a summary statistic, 

• e > a tolerance level, 

. P^ = {z\p( V (z), r?(y))<e}. 



The idea behind ABC ( |Beaumont et al.|2002[ ) is that 
the summary statistics coupled with a small tolerance 
should provide a good approximation of the posterior 
distribution: 

7T e (0|y) = J n e (9,z\y)dz^n(9\y). 



algorithm that samples from 7r £ (#,z|y), and the 
marginally from 7r e (<?|y); this algorithm only requires 
the ability to sample from f(-\9). This is the likeli- 
hood free MCMC sampler: 



Likelihood free MCMC sampler 
For a computing effort N 

1) Use the likelihood free rejection sampling to get 
a realization 9^°' from the ABC target distribu- 
tion 7T e (0|y), 

2) Set t = 1, 

3) Generate 9' from the Markov kernel 



4) Generate z from the likelihood f{-\8 ), 

5) Generate u from W[ 01 ] , 

n(9')q(9^\9') 

6) Itm< — — — llp e , y (z), 

7 r(6» ( ^ 1) )g(6»'|0 (t - 1) ) 

set 0®=& else 8^ = 9^, 

7) Set t = t+ 1, 

8) If i < JV return to 3). 



Rejection sampling and MCMC methods can per- 
form poorly if the tolerance level e is small. Con- 
sequently various sequential Monte Carlo algorithms 
have been constructed as an alternative to both meth- 
ods. For instance, Beaumont et al. (2009) proposed 



It has been shown in Marjoram et al. (2003) that 



it is possible to construct a Metropolis-Hastings 



an ABC version of the Population Monte Carlo algo- 
rithm presented above. The key idea is to decompose 
the difficult issue of sampling from 7r e (0, z|y) into a 
series of simpler subproblems. The algorithm begins 
at time sampling from 7r eo (0, z|y) with a large value 
£0) then simulating from an increasing difficult se- 
quence of target distribution 7r Ct (9, z|y), that is when 
et < et-i. 

Example 13 (Continuation of Example [8]). Figure[9] 
provides an illustration of the above algorithm when 
applied to the probit model with the three covari- 
ates described in Example [TJ In this artificial case, 
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6 Final remarks 




Figure 9: Comparison between density estimates of 
the marginal posterior distributions of fa (left), fa 
(center) and fa (right) obtained by ABC (in red) and 
MCMC samples (in black) in the setup of the Pima 
Indian diabetes study. 



This tutorial is necessarily incomplete and biased: 
the insistance on model choice and on variable di- 
mension models is also a reflection of the author's 
own interests. Others would have rather chosen to 
stress the relevance of these simulation methods for 
optimal design |MMer|1999[|Muller et al.|2004 



jonction with simulated annealing (e.g. Andrieu and 



Doucet 2000 


Doucet et al.|2002|) 


regression ( 


Holmes et al. 20021 



m con- 



of continuous time stochastic processes (Dellaportas 



et al. 2004 Beskos et al. 2006). That such a wealth 



of choices is available indicates that the field still un- 
dergoes a formidable expansion that should benefit a 
wide range of areas and disciplines and, conversely, 
that the continued attraction of new areas within the 
orbit of Bayesian computational methods backfeeds 
their creativity by introducing new challenges and 
new paradigms. 
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the ABC outcome can be compared with the MCMC 
"exact" simulation described above and the result 
is striking in that the ABC approximation is con- 
founded with the exact posterior densities. The tun- 
ing of the ABC algorithm is to use 10 6 simulations 
over 10 iterations, with bounds e t set as the 1% quan- 
tile of the simulated p(n(z),n(y)), p chosen as the 
Euclidean distance, and ry(z) as the predictive dis- 
tribution based on the current parameter fa made of 
the Q(xJ fa's, while ri(y) is the predictive distribution 
based on the MLE fay) made of the $(cc^/3(y))'s. In 
this special case we are therefore avoiding the sim- 
ulation of the observations themselves as predictive 
functions are available. This choice reduces the vari- 
ability in the divergence between rj(z,) and n(y), and 
explains for the very good fit between the densities. 
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